Policy-based data deduplication

ABSTRACT

A data storage site receives data from different data producer sites. Each of the data producer sites has a particular relationship to the data storage site, and each particular relationship carries corresponding data storage policies, constraints and commitments. When a data storage site receives a data storage request from a data producer, and that particular data is already present from a prior storage operation at the data storage site, the characteristics of the policies, constraints and commitments that were applied when that data was saved by the prior storage operation are reconciled with the policies, constraints and commitments of the requesting data producer. Deduplication logic reconciles different sets of policies, constraints and commitments such that the data can be effectively deduplicated by saving data-producer-specific metadata. Alternatively, the data can be effectively deduplicated by promoting the storage of the data so it covers a broader set of policies, constraints and commitments.

FIELD

This disclosure relates to data deduplication, and more particularly totechniques for policy- and rule-based action reconciliation inhigh-performance data deduplication environments.

BACKGROUND

With the explosion of data, more and more techniques are needed tomanage unnecessary duplication of data items. In some deduplicationregimes, when responding to a request for storing data (e.g., a writerequest in a backup scenario), a file system or agent checks to see ifthe identical data item (e.g., the identical file or identical portionof the particular file, or the identical block of the particular file)already exists in the file system or other managed storage repository.If so, deduplication logic will prevent the data item from being storedagain, and the request for storing data is satisfied by the occurrenceof the already-stored data item—without duplicating storage of theidentical data by storing it again. Checksums or other fingerprints areused to determine uniqueness of the data item, which uniquenesscharacteristic is in turn used to determine whether or not the identicaldata item already exists in storage.

In modern computing environments, a single data repository at a centralsite can be accessed by multiple independently-operated sites (e.g.,satellite sites). Each independently operated site might have one ormore individual data owners, which in turn might have individual datastorage relationships with respect to the central site. Suchrelationships include contracts, subscriptions, commitments, and relatedpolicies such as backup frequency policies, restore point commitments,and other service level agreement provisions. For example, a first dataowner might have a policy to store its data in a “higher tier” or“highest tier” of the data repository while a second data owner mighthave a policy to store its data in a “lower tier” or “lowest tier” ofthe data repository.

Unfortunately, deduplication logic (e.g., for making decisions toreplicate a block of data or not to replicate a block of data to astorage repository) is often based merely on a fingerprint or othercharacteristic of uniqueness of the data in the block. This coarse logicis deficient. Specifically, techniques that decide not to replicate ablock of data merely based on a fingerprint or other characteristic ofuniqueness are deficient, at least in that they fail to consider othercharacteristics that might apply to the particular requestor/owner, orto the particular block or portion of the file, etc.

What is needed are techniques for deduplication that improve over theaforementioned deficiencies.

SUMMARY

The present disclosure describes techniques used in systems, methods,and in computer program products that implement storage policyreconciliation when performing data deduplication, which techniquesadvance the relevant technologies to address technological issues withlegacy approaches. More specifically, the present disclosure describestechniques used in systems, methods, and in computer program productsfor rule-level reconciliation.

The disclosed embodiments modify and improve over legacy approaches. Inparticular, the herein-disclosed techniques provide technical solutionsthat address the technical problems attendant to making decisions toreplicate a block of data or not to replicate a block of data to astorage repository. Such technical solutions relate to improvements incomputer functionality. Various applications of the herein-disclosedimprovements in computer functionality serve to reduce the demand forcomputer memory, reduce the demand for computer processing power, reducenetwork bandwidth use, and reduce the demand for inter-componentcommunication. Some embodiments disclosed herein use techniques toimprove the functioning of multiple systems within the disclosedenvironments, and some embodiments advance peripheral technical fieldsas well. As one specific example, use of the disclosed techniquesresults in storing less data than would otherwise be stored while stillhonoring demands that derive from different characteristics of differentdata producers. Storage of less data reduces the size of data cataloguesand indexes, which in turn reduces the amount of computer processingpower needed to access stored data and its metadata, which means thatstorage and retrieval systems that comport with the embodimentsdisclosed herein are more efficient than other systems.

Moreover, use of the disclosed techniques and devices within the shownenvironments as depicted in the figures provide advances in thetechnical field of data storage as well as advances in various technicalfields related to computing platform management.

Further details of aspects, objectives, and advantages of thetechnological embodiments are described herein and in the drawings andclaims.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings described below are for illustration purposes only. Thedrawings are not intended to limit the scope of the present disclosure.

FIG. 1A is a block diagram of an environment in which systems forpolicy-based data deduplication can operate.

FIG. 1B is a schematic representation of how a storage event can beprocessed through relationship-based deduplication logic so as toimplement policy-based data deduplication, according to an embodiment.

FIG. 2 is a diagram showing a data flow to implement policy-based datadeduplication, according to an embodiment.

FIG. 3 is a data flow diagram depicting use of policy metadata and arules database in a system for policy-based data deduplication,according to an embodiment.

FIG. 4 is a data flow diagram showing an event analysis flow as used insystems that perform policy-based data deduplication, according to anembodiment.

FIG. 5 is a data flow diagram showing a data item status determinationtechnique as used in systems that perform policy-based datadeduplication, according to an embodiment.

FIG. 6 is a data flow diagram showing a rule collection technique asused in system that performs policy-based data deduplication, accordingto an embodiment.

FIG. 7 is a data flow diagram showing a rule analysis technique as usedin systems that perform policy-based data deduplication, according to anembodiment.

FIG. 8 is a diagram depicting a storage instruction dispatch techniquefor use in systems that implement policy-based data deduplication,according to an embodiment.

FIG. 9A and FIG. 9B depict system components as arrangements ofcomputing modules that are interconnected so as to implement certain ofthe herein-disclosed embodiments.

FIG. 10A, FIG. 10B, and FIG. 10C depict virtualized controllerarchitectures comprising collections of interconnected componentssuitable for implementing embodiments of the present disclosure and/orfor use in the herein-described environments.

DETAILED DESCRIPTION

Embodiments in accordance with the present disclosure address theproblem of making policy-specific decisions to replicate a block orrange of data or not to replicate a block or range of data to a storagerepository. Some embodiments are directed to techniques fordeduplication that consider particular storage capabilities that are bedesired or required by a particular data owner. Example embodimentsconsider sets of storage capabilities or desires pertaining to a blockor range of data from a first site or owner that are different from theparticular storage capabilities or desires of a different site or dataowner of the identical block or range of data. The accompanying figuresand discussions herein present example environments, systems, methods,and computer program products to implement rule-based reconciliation inpolicy-based data deduplication.

Overview

Disclosed herein are techniques that are used to decide to replicate ornot to replicate a particular data item based on that data item owner'spolicies or that data item owner's rules or requirements. In some cases,a decision to replicate or not to replicate a particular data item mightbe based on a then-current status of the data item and/or the manner inwhich the data item had been stored as a result of operation of aprevious storage request. For example, a first data owner might requirethat a data item comprising “file F” (e.g., a copy of “Spiderman”) is tobe stored in a top tier of a multi-tier storage facility, whereas adifferent, second data owner might specify that its data item comprising“file F” (e.g., another identical copy of “Spiderman”) is to be storedin a lowest tier of the same multi-tier storage facility. Data can bededuplicated by not storing a physical second copy of “Spiderman” in thelowest tier and, instead, merely indicating that the physical secondcopy of “Spiderman” that would have been stored in the lowest tier canbe accessed from the copy in the top tier. If the top tier data itemthat is owned by the first data owner is ever deleted, then the copythat is stored in the lower tier is marked as owned by the second dataowner.

In accordance with embodiments as disclosed herein, determination andoperation of relationship-based storage instructions (e.g., to duplicateor not, and/or how and/or where to duplicate) serve to reduce computingresources required to serve multiple sites that have varying policies.Some embodiments issue relationship-based storage instructions to two ormore data storage repositories. As the number of satellite sites andcorresponding relationships to the data storage repositories increases,so increases the efficiency of the herein-disclosed deduplication systemas a whole.

Definitions and Use of Figures

Some of the terms used in this description are defined below for easyreference. The presented terms and their respective definitions are notrigidly restricted to these definitions—a term may be further defined bythe term's use within this disclosure. The term “exemplary” is usedherein to mean serving as an example, instance, or illustration. Anyaspect or design described herein as “exemplary” is not necessarily tobe construed as preferred or advantageous over other aspects or designs.Rather, use of the word exemplary is intended to present concepts in aconcrete fashion. As used in this application and the appended claims,the term “or” is intended to mean an inclusive “or” rather than anexclusive “or”. That is, unless specified otherwise, or is clear fromthe context, “P employs A or B” is intended to mean any of the naturalinclusive permutations. That is, if P employs A, P employs B, or Pemploys both A and B, then “P employs A or B” is satisfied under any ofthe foregoing instances. As used herein, at least one of A or B means atleast one of A, or at least one of B, or at least one of both A and B.In other words, this phrase is disjunctive. The articles “a” and “an” asused in this application and the appended claims should generally beconstrued to mean “one or more” unless specified otherwise or is clearfrom the context to be directed to a singular form.

Various embodiments are described herein with reference to the figures.It should be noted that the figures are not necessarily drawn to scaleand that elements of similar structures or functions are sometimesrepresented by like reference characters throughout the figures. Itshould also be noted that the figures are only intended to facilitatethe description of the disclosed embodiments—they are not representativeof an exhaustive treatment of all possible embodiments, and they are notintended to impute any limitation as to the scope of the claims. Inaddition, an illustrated embodiment need not portray all aspects oradvantages of usage in any particular environment.

An aspect or an advantage described in conjunction with a particularembodiment is not necessarily limited to that embodiment and can bepracticed in any other embodiments even if not so illustrated.References throughout this specification to “some embodiments” or “otherembodiments” refer to a particular feature, structure, material orcharacteristic described in connection with the embodiments as beingincluded in at least one embodiment. Thus, the appearance of the phrases“in some embodiments” or “in other embodiments” in various placesthroughout this specification are not necessarily referring to the sameembodiment or embodiments. The disclosed embodiments are not intended tobe limiting of the claims.

DESCRIPTIONS OF EXAMPLE EMBODIMENTS

FIG. 1A is a block diagram of an environment 1A00 in which systems forpolicy-based data deduplication can operate. As an option, one or morevariations of environment 1A00 or any aspect thereof may be implementedin the context of the architecture and functionality of the embodimentsdescribed herein.

The environment 1A00 includes several producer sites 102 thatcommunicate over a network 103 to a data storage site. The data storagesite might be configured to function as a disaster recovery datarepository, where any of the producer sites send their respective dataitems to the disaster recovery data repository in case some disaster orother situation raises a need for data to be restored at a producersite. In the context of forms of data deduplication that might beperformed at the data storage site, it often happens that some data “X”might exist at more than one of the producer sites. For example, and asshown, each of the producer sites might host a respective copy of thefilm “Spiderman”. In most cases, just one copy of “Spiderman” needs tobe saved at the disaster recovery data repository. This is because,since each copy of “Spiderman” is exactly identical to every other copy,in the event of a need for data recovery at any one of the producersites, any single copy of the film “Spiderman” can be used to restore“Spiderman” at the producer site.

However, it sometime happens that one site (e.g., site S1 142) has adifferent service level agreement (SLA) or different contract provisionsor different restore point objectives or other codifications of a datastorage relationship as compared to the other sites (e.g., site S2 144,site S3 146). This can happen when one or another site pays more for(for example) faster disaster recovery. It can also happen for otherreasons that can be characterized in an SLA or other relationshipinformation that is accessible to the data storage site. Continuing the“Spiderman” example, it might be that a deduplication operation at thedata storage site receives or accesses metadata pertaining to the file“Spiderman” very infrequently. Recognizing that the file is onlyinfrequently accessed, “Spiderman” becomes a good candidate to be storedat a lower tier of storage (e.g., at a cheaper, Tier2 of storage.). Ifanother site submits a request (e.g., in context of a disaster recoverybackup scenario) to store its copy of “Spiderman”, the data storage sitecan determine that a copy of “Spiderman” already exists and need not bestored again, thereby performing deduplication by not storing anothercopy of “Spiderman”.

The foregoing determination assumes that the one copy of “Spiderman” issufficient to satisfy the SLAs of all of the data producer sites. Thisassumption might be a valid assumption, or it might not be a validassumption. Suppose that the deduplication logic determined to store atthe lower tier of storage would satisfy the SLA of the requesting site,but storage of that one occurrence of “Spiderman” in the lower tier orstorage would violate another site's SLA, then that logic fordeduplication is deficient, at least because the deficient logic failsto recognize variations in SLAs or other relationships between sites.

As shown, the relationship-based deduplication action determinationlogic 114 performs a test at switch 115 to determine if there is anyvariation with respect to the manner in which some particular data is tobe stored. There could be many reasons why there is a variation in themanner in which some particular data is to be stored and, in some cases,the particular data is stored in a different manner than the manner inwhich an already existing deduplicated copy of the data exists at thecentral site 110. For example, it might happen that one of the producersites 102 (e.g., site S1 142) might have a first SLA or relationshipwith the data storage site 105 that is different from the relationshipbetween a second producer site and the data storage site. As such, eventhough there is an existing copy of “Spiderman” in the “Tier2” storage,the relationship-based deduplication action determination logic 114might determine that an additional copy of “Spiderman” needs to bestored in “Tier1” storage, and thus, the “Path2 (X2)” path is taken. Onthe other hand, if the relationship-based deduplication actiondetermination logic 114 and or the logic of switch 115 determines (e.g.,due to tenant partitioning constraints) that even though there isalready a stored instance of “Spiderman” in “Tier2” storage, thatanother instance of “Spiderman” is to be stored in the remote datastorage facility 120, then the “Path1 (X1)” path is taken. A set ofrelationship-based storage instructions 104 are delivered to the remotedata storage facility 120 such that the remote data storage facility 120will store the instance of “Spiderman” in a manner that comports withthe particular relationship and/or any applicable policies, and/orconstraints and/or commitments pertaining to the particularrelationship.

The relationship-based deduplication action determination logic 114functions based on several system aspects such as are depicted inTable 1. Specifically, the relationship-based deduplication actiondetermination logic might process a storage request based on topologywhere each site is a node of a topology graph, and/or relationshipsbetween nodes of the topology graph, and/or policies that are associatedwith any pairs of nodes and/or any rules that implement a policy orportion thereof.

TABLE 1 System Aspect Usage Topology Determines the presence of arelationship between one computing site relative to another computingsite Relationship Specifies an aspect or name between two sites PolicyName that describes a particular set of rules Rule Specifies how aparticular data item is to be handled

In the specific environment of FIG. 1A, relationship information 130 canbe received from any particular node, and then stored in a relationshipdatabase 132 at the central site, which in turn comprises a table thatincludes a relationship lookup value (e.g., a site identifier such as‘S1’, ‘S2’, etc., as shown), which relationship lookup key correspondsto a particular policy or policies. A policy is a set of rules that areto be observed when performing deduplication operations. As shown, anyparticular policy (e.g., policy P1 pertaining to site S1) is composed ofone or more rules, each of which rules are identified by a rule name(e.g., a rule named R1, a rule named R2, etc.). The rules in turn arecodified and stored so as to be analyzed by a computing process. Therelationship-based deduplication action determination logic 114 canprocess any number of sets of relationship information. As such, theshown three sets of relationship information (e.g., S1 relationshipinformation, S2 relationship information, and S3 relationshipinformation) are depicted merely for illustrative purposes so as to showhow multiple instances of the same file (e.g., having the same checksum)might be processed differently based on different relationshipinformation.

More particularly, the aforementioned relationship-based deduplicationaction determination logic 114 is capable of determining that aparticular data item is a new, unique data item that is to be stored forthe first time at the central site, or if a particular data item mightalready be stored at the central site, and thus deduplication is to beconsidered with respect to the relationship between the requestor andthe central site. In some cases, the determination as to whether or nota particular data item is already in existence at the central site canbe facilitated by a master directory 112 in which “fingerprints” orchecksums of data items are stored for access by the relationship-baseddeduplication action determination logic 114.

For example, master directory 112 might be composed of entries of“fingerprints” or checksums. If a “fingerprint” or checksum for aparticular data item exists in the master directory, then it followsthat the particular data item exists in at least one storage location ofthe central site. As particular data items are processed by the centralsite, and/or as particular data items are removed from storage at thecentral site, the master directory 112 is updated accordingly. In somecases, the master directory includes pointers and/or attributes thatcorrespond to the particular data item of each entry. As such, an access(e.g., by relationship-based deduplication action determination logic114) can result in determination of not only where the data can be foundbut also how the data is stored. An entry in the master directory 112can include any number of storage attributes of the particular dataitem. Strictly as one example, a storage attribute might indicate thatthe data can be found at the remote data storage facility 120, and/orthat the data can be found in, for example, “Tier1” storage of the localdata storage facility 118, and/or that the data can be retrieved usingany one or more agents that are interfaced to a remote data storagefacility.

Making the determination as to whether or not a particular data item isto be stored is based on not only the existence (or not) of theparticular unique data item at the central site, but also based on atleast some aspect of the aforementioned relationship as well as theexistence (or not) of the particular unique data item at the centralsite.

The foregoing discussion of FIG. 1A includes the establishment and useof several tables and/or data structures that are used in ongoingdeduplication operations. Such data structures and uses are discussed infurther detail hereunder. In particular, the shown relationship-baseddeduplication logic 114 or any variations thereof may use theaforementioned data structures to achieve various outcomes that comportwith policy-based data deduplication.

FIG. 1B is a schematic representation 1B00 showing how a storage eventcan be processed through relationship-based deduplication logic so as toimplement policy-based data deduplication. Specifically, FIG. 1B depictshow a storage event can be processed through relationship-baseddeduplication logic 114 so as to implement policy-based datadeduplication. As shown, when a storage event (e.g., store a file, storea block, etc.) is received, a determination is made as to how to processthe storage event. In one situation shown as “Do not save anotheroccurrence of data ‘X’” the logic determines that an occurrence of ‘X’(e.g., original occurrence of data ‘X’) is already stored in a storagefacility in the manner that comports with characteristics of the storageevent and/or the requestor, and thus another occurrence of data ‘X’ neednot be stored.

There are situations where data deduplication is not indicated—such aswhen a requested storage event itself, and/or characteristics of therequestor, and/or characteristics of the relationship between therequestor and the storage facility—are such that the nature orcharacteristics of an original stored occurrence of data ‘X’ does notfully satisfy the requested storage event. In some such situations, data‘X’ would need to be stored separately in the manner prescribed by therequested storage event itself, and/or by characteristics of therequestor, and/or by characteristics of the relationship between therequestor and the storage facility. For example, and as shown, anoriginal occurrence of data ‘X’ is stored in a first location (e.g.,Location1) and a copy of data ‘X’ is stored in a second location (e.g.,Location2). Strictly as one example, the first location might be within“Tier1” storage area of the storage facility and the second locationmight be within “Tier2” storage area of the storage facility.

Further, there are situations where data deduplication can beaccomplished while still honoring demands that derive from differentcharacteristics of different data producers by merely storing separatemetadata for data ‘X’—without storing a second occurrence of data ‘X’.Since metadata for any particular data item is often much smaller (e.g.,1/1000^(th) the size or 1/100^(th) the size or 1/10^(th) the size, etc.)that the data item itself, this is an efficient use of storage space.

Even still further, there are situations where the storage event isprocessed by the relationship-based deduplication logic 114 toaccommodate efficient deletion of data that is no longer in use. Onesuch situation is depicted by the “Other Action” path. Specifically,metadata associated with an occurrence of data ‘X’ is marked such thatwhen there are no referrers to data ‘X’, it can be safely deleted. Otherdeduplication possibilities for handling a data item and/or its metadataare disclosed herein.

The aforementioned relationships of a requestor to or with a storagefacility are merely one type of characteristic that can distinguish onedata producer from another data producer. Other types might includecontracts and/or provisions thereof, and/or subscriptions and/orprovisions thereof, and/or contractual commitments, and/or storagepolicies or and/or storage commitments such as backup frequencypolicies, restore point policies/commitments, storage object handlingrules, storage deduplication parameters, storage deduplication parametervalues, etc. As such, the foregoing are merely examples ofcharacteristics that might distinguish one data producer from anotherdata producer.

FIG. 2 is a diagram showing a data flow 200 to implement policy-baseddata deduplication. As an option, one or more variations of data flow200 or any aspect thereof may be implemented in the context of thearchitecture and functionality of the embodiments described herein. Thedata flow 200 or any aspect thereof may be implemented in anyenvironment.

The embodiment shown in FIG. 2 is merely one example. As shown, the dataflow 200 is composed of setup operations 202 and ongoing deduplicationoperations 204. The setup operations 202 include mechanisms to identifya computing environment that comprises a plurality of data producersites that communicate data items with a shared data storage site (step210). Such identification can arise from a given topology map or othersuch data structure, and/or from registration operations carried outbetween nodes in the computing environment, and/or using any knowntechnique. At step 205, any or all of the plurality of data producersites can be populated into a column of a data structure such as theleftmost column of the shown relationship data structure 211 ₁.

In another column of the relationship data structure 211 ₁, some or allof the data producer sites may have a corresponding policy or set ofpolicies. A policy or set of policies can be codified such as byreferring to a policy by name (e.g., “P1”, “P2”, etc.), and/or byassociating a policy name to a set of constituent rules (e.g., R1, R2,etc.). Irrespective of the mechanisms and/or techniques to populate therelationship data structure, ongoing data deduplication operations canderive policy metadata 222 directly or indirectly from the relationshipdata structure 211 ₁.

As depicted in FIG. 1A and FIG. 1B, the data producer sites operateindependently. Each data producer site might perform various backupoperations (e.g., transmitting disaster recovery data to a disasterrecovery data storage site), and any such transmission might include adata item 206 that raises a potential storage deduplication event 207.

Upon occurrence of a potential storage deduplication event, a flowcomprising a set of deduplication operations is invoked. At step 220,the event is associated, directly or indirectly, with the originator ofa block of data or range of data such that the block or range can beassociated to an originating site or owner that is in turn associatedwith any of the one or more policies that were established in the setupoperation 202. As an example, if a particular potential storagededuplication event is raised by a process of site “S1”, then byperforming a lookup operation over the relationship data structure, thepre-established association of “S1” to policy “P1” can be retrieved. Theconstituent rules of the associated policy (e.g., rule R1, rule R2) canbe retrieved in the same access. In some cases, such as is depicted inFIG. 1A, the associations can be stored in a relationship database thatis accessible to the relationship-based deduplication actiondetermination logic 114. As such, the associations can be codified as aresponse to a query. In other cases, the entire relationship datastructure is retrieved and policy metadata can be codified using anyknown technique that produces information about policies (e.g., policymetadata 222) that facilitates correlations between sites and policies,and/or processes and policies, and/or policies and rules, etc.

At step 230, aspects of the potential storage deduplication event areconsidered to determine whether or not a policy is applicable and, ifso, which policy or policies are at least potentially applicable to theevent. In many cases, when the potential storage deduplication event isdeemed to indeed be subject to consideration with respect to a policy orpolicies, then metadata for the data that is the subject of the event isretrieved. As one example, any of the data item(s) (e.g., block or rangeof blocks) that pertain to the event can have associated data itemmetadata that is delivered with the event. In some cases, metadata canbe generated and/or retrieved based on aspects of the event. Forexample, if an event pertains to block “X”, the fingerprint or checksumof block “X” can be calculated and the fingerprint or checksum can beincluded together with or as a part of data item metadata 232.

Given the data item metadata, at step 240 the status of the underlyingdata item can be determined. For example, the aforementioned fingerprintor checksum can be compared against fingerprints or checksums in masterdirectory 112. Performance of step 240 results in a data structure thatcharacterizes the then-current state of the data item (data item status233), which is used in subsequent processing. As shown, step 250 isperformed concurrent with step 240. In step 250, policy metadata 222 isanalyzed to determine a set of rules that are at least potentiallyapplicable to the previously-retrieved policies.

The data item status 233 and the set of at least potentially applicablerules 244 are made available to subsequent processing. In the shownexample, step 260 analyzes the set of potentially applicable rules 244with respect to the status of the data item. In some cases, a rule isimmediately applicable to a data item having a particular status. Forexample, if a rule states, “always store ‘hot’ data items in ‘Tier1’”,and the data item status includes a “hot” indication, then that dataitem should be stored in “Tier1”. In other cases, it can happen that arule is not definitively known to be applicable or not until all of theat least potentially applicable set of rules have been considered.(Further details pertaining to application of rules is given as shownand described as pertains to FIG. 7.)

Continuing with the discussion of FIG. 2, when the rules have beenanalyzed with respect to the status of the data item, then step 270 isentered. The applicable actions 245 are transformed into instructionsthat serve to implement the policies that pertain to the data item. Forexample, if policy “P1” includes rule “R1” to “always store ‘hot’ dataitems in ‘Tier1”’, and the subject data item “X” was deemed to be ‘hot’,then instructions having the semantics of “store data item “X” in Tier1‘” is emitted. Referring again to FIG. 1A, if the instruction, “storedata item “X” in Tier1’” is issued to local data storage facility 118,then the local data storage facility would store the data item “X” inits Tier1 storage area, and/or with its Tier1 storage characteristics.

The foregoing discussion of FIG. 2 includes discussion of sites,policies and rules. The relationships between sites and policies, andthe relationships between policies and rules, as well as exampletechniques for how to make and use such associations, are shown anddiscussed as pertains to FIG. 3.

FIG. 3 is a data flow diagram 300 depicting use of policy metadata and arules database in a system for policy-based data deduplication. As anoption, one or more variations of data flow diagram 300 or any aspectthereof may be implemented in the context of the architecture andfunctionality of the embodiments described herein. The data flow diagram300 or any aspect thereof may be implemented in any environment.

The data flow diagram 300 includes steps for performance of the shownstep 205. Step 205 (e.g., as introduced in FIG. 2) serves to establishrelationships between data producer sites and policies and/or rules. Asdepicted in FIG. 3, it does this by carrying out a sequence of steps.Specifically, and as shown, the topology of a multi-site system isdetermined (step 310). This can be accomplished using any knowntechnique. In some cases, a topology is given as a graph with nodes andedges. In other cases, topological relationships between data producersites and a corresponding one or more data storage sites are given in atable.

At step 320, the topological semantics of the foregoing topologicaldeterminations are used to identify the set of data producer sites ofthe multi-site system. Next, for each identified site, step 305 servesto correlate or establish policies that pertain to a particular one ofthe sites. In some cases, a set of policies are known to be correlatedto a particular site based on the existence of an SLA. In other cases,an administrator completes a form that assigns named policies to a site.Irrespective of the particular technique to process policies thatpertain to a particular one of the sites, a data structure such as theshown relationship data structure 2112 is populated. When all of thedata producer sites have been considered, then processing moves to stepsthat further populate the relationship data structure 2112 with rulesfor each policy.

Specifically, and as shown, step 330 serves to retrieve all or part ofrelationship data structure 2112, or step 330 serves to retrieve policymetadata 222 that is derived from relationship data structure 2112. Foreach named policy, and based on the union of the named policies that areso retrieved, any associations between a named policy and a set ofconstituent rules are determined. For example, if a named policy is“Platinum-level SLA” and the terms of the “Platinum-level SLA” include aprovision to “restore within 12 hours”, then an association between“Platinum-level SLA” and a rule such as “never store in remote datastorage facility” is established.

Codification of such rules and techniques for forming associationsbetween policies and codified rules can use any known techniques.Strictly as examples, the semantics of a rule can be codified in amarkup language such as the extensible markup language (XML). Or, insome cases, a rule is coded as a predicate test such as an “IF” clause,and the “THEN” clause can be coded as an action to be taken when thepredicate evaluates to TRUE. Step 342 is performed for each namedpolicy. The result of performance of step 342 includes formation of arules database 345. The rules database might include policy metadata 222that holds a specific association between a named policy such as “P1”and any one or more rules. In the example shown, policy “P1” includes arule “time to live (TTL) after deletion is 3 days” as well as anotherrule that specifies to “use MD5 for encryption”.

The embodiment shown in FIG. 3 is merely one example flow of setupoperations that result in correlations of particular producer sites topolicies. The foregoing setup operations need not be specific to thelevel or granularity of a site. Rather, correlations to policies mightbe formed based on particular data types (e.g., a .DOCX document, or a.MOV document, etc.) and respective data-specific policies. Or, in someembodiments, correlations to policies might be formed based oncharacteristics of an entity (e.g., an agency, a clearinghouse, etc.)and respective entity-based policies. Or, correlations to policies mightbe formed based on characteristics of an individual and/or his or herroles (e.g., a manager role, an employee role, etc.) and respectiverole-based policies. Still further, correlations to policies might beformed based on characteristics of a spending objective or performanceobjective.

When all or portions of the setup operations have been initiated and/orcompleted so as to correlate a source to one or more policies, and/orwhen all or portions of the setup operations have been initiated and/orcompleted so as to populate a database of rules and policy metadata,then incoming events raised by the sources can be analyzed with respectto such rules and policy metadata. One technique for event analysis isgiven in the following FIG. 4.

FIG. 4 is a data flow diagram showing an event analysis flow 400 as usedin systems that perform policy-based data deduplication. As an option,one or more variations of event analysis flow 400 or any aspect thereofmay be implemented in the context of the architecture and functionalityof the embodiments described herein. The event analysis flow 400 or anyaspect thereof may be implemented in any environment.

The event analysis flow 400 results in generation of metadata thatcharacterizes the event type as well as any other information that wouldbe at least potentially used for making deduplication decisions. Inprevious processing (e.g., in step 220) a potential storagededuplication event 207 and a corresponding data item are used toretrieve applicable policy metadata. Such policy metadata might or mightnot be sufficient to make downstream deduplication decisions. As such,the shown event analysis flow 400 serves to collect additionalinformation. In this embodiment, step 410 forms an event record 415based at least in part on an occurrence of a potential storagededuplication event 207. In some cases a potential storage deduplicationevent might be a storage I/O command (e.g., WRITE I/O command) and, assuch, the storage I/O command might be given in a particular format,which might not, by itself, include enough information to makedownstream deduplication decisions. Therefore, step 420 through step440, including decision 425 and the iteration loop 427 are performed soas to collect and codify data item metadata 232, which in turn is usedin making downstream deduplication decisions.

Specifically, at step 420 a table or other data item is accessed todetermine a set of characteristics that at least potentially apply tomaking downstream deduplication decisions based on the potential storagededuplication event 207. Strictly as examples, such characteristicsmight include the source of the event, the time of the event, the actionor actions that explicitly or implicitly pertain to the event, and/or aset of attributes that pertain to the data item to be considered fordeduplication. For each such retrieved characteristic, decision 425 istaken to determine if the characteristic and/or its value is at leastpotentially applicable to making downstream deduplication decisions. Ifnot, the “No” branch of decision 425 is taken. Otherwise, the “Yes”branch of decision 425 is taken and step 430 is entered to collectinformation pertaining to the characteristic of the then-currentiteration. Strictly as an example, information that might pertain to asource characteristic might be the site name or requestor's name. Asanother example, information that might pertain to a time characteristicmight be codified as a timestamp or sequence control number of theparticular even being considered. Still further, information that mightpertain to the data item itself might include a fingerprint,encryption-related information, etc. At step 440, while information suchas the foregoing is collected, the information is codified and stored asdata item metadata. The iteration loop proceeds over all of thecharacteristics that were collected in step 420. When the iteration loopexits, data item metadata 232 is ready to be presented to or fordownstream processing.

Returning again to the discussion of the potential storage deduplicationevent 207 and collection of characteristics of the event, the incomingevent might be raised by a data producer that seeks to push a data itemto the storage site (e.g., for disaster recovery purposes), or theincoming event might be raised by an agent in the storage site thatseeks to purge a data item based on expiration of a data retentionpolicy. Each of these two cases can be determined by analyzing the eventand/or any data pertaining to the event. More specifically, informationpertaining to the event might be received in or with the eventindication. For example, an event might be raised after a data producersite sends backup data to the data storage site. Such backup data mightbe sent along with, or as a part of, a message that is transmitted overa network. As another example, an event might be raised after an agentat the data storage site invokes a subroutine at the data storage site.Such a subroutine might include an explicit indication of the type ofevent (e.g., a data retention purge event indication value). Orinformation pertaining to the event might be implied based at least inpart on the name or occurrence of the invoked subroutine. In someembodiments, certain portions of metadata for the particular data itemmight be included in or with a message that raises the event. In othercases, the metadata or portions thereof for the particular data item isretrieved from any available repository, possibly from a cache.

Upon completion of the iteration loop, which coincides with theconclusion of step 230, processing is passed to downstream processing.Specifically, and as shown in FIG. 2, after completion of step 230, step240 retrieves or calculates a status indication for the data item. Sucha status indication, in combination with a set of rules pertaining tothe data item and/or its status is used to determine applicable actions245 to take. One possible technique for retrieving or calculating a dataitem status is given in FIG. 5.

FIG. 5 is a data flow diagram showing a data item status determinationtechnique 500 as used in systems that perform policy-based datadeduplication. As an option, one or more variations of data item statusdetermination technique 500 or any aspect thereof may be implemented inthe context of the architecture and functionality of the embodimentsdescribed herein. The data item status determination technique 500 orany aspect thereof may be implemented in any environment.

The embodiment shown in FIG. 5 is merely one example. As shown, the dataitem status 233 determination technique includes step 510 to access themaster directory. If the master directory includes an occurrence of arecord with status information pertaining to the particular data item,then at decision 512, the “Yes” path is taken. The data item that isidentified by a matching fingerprint or checksum is retrieved at step540. Step 550 serves to update the values and/or counts in the dataitem's corresponding data item status record.

In some embodiments, the number of uses of a particular data item (e.g.,the shown data item use count 513) is stored in a master directory. Sucha use count can be considered when deciding to delete a particular dataitem occurrence. For example, suppose that many sites had raised anevent to store “Spiderman”, it might be that only one occurrence of“Spiderman” was actually stored. Suppose further that a particular oneof those sites raised a request to delete “Spiderman”. That requestcould be satisfied so long as the other sites no longer reference“Spiderman”. To keep track of how many referrers expect to be able toretrieve “Spiderman”, a master directory might keep a correspondencebetween a particular data item (e.g., “Spiderman”) and the number ofreferrers. Only after the last referrer indicates a deletion request canthe occurrence of “Spiderman” actually be deleted.

The foregoing discussion includes operations for recording counts. Insome cases, counts can be used to determine if there are any referrers.Counts can also be used to determine which policies apply to a givendata item. Such counts can be included as a data item used count 513that is stored in the master directory (e.g., in some association to theunderlying data item), or can be stored in or with occurrences of a dataitem status record 532. When the “No” branch of decision 512 is taken,then at step 520, a data item status record is generated for a new dataitem (e.g., a new data item for which the master directory does notcontain a corresponding data item status record). Also, at step 530, thecounts pertaining to the particular instance of a data item statusrecord that was generated in step 520 are stored. Processing of step 520and step 530 occurs when decision 512 determines that the incoming dataitem does not already exist in the master directory, specifically, the“No” path of decision 512 is taken, at which time step 520 and step 530serve to generate and populate a status record for the data item. Asshown, status records are stored in the master directory 112.

Moreover, any aspects of the data item (e.g., aspects that are given inthe data item metadata) can be retrieved/stored from/in a record of themaster directory. This record (e.g., data item status record 532) of thedata item metadata is periodically updated in the master directory toreflect the then-current status as deemed by the storage site. This isshown in the example with respect to the “Age” characteristic. The “Age”characteristic can hold a value of a timestamp referring to a lastaccess (e.g., last WRITE or last DELETE). Such a timestamp and a TTL canbe used in combination with data retention rules to determine if aparticular data item should be purged (e.g., upon the expiration of aTTL or retention period). Additionally, data records as stored in themaster directory might keep track of policies that had been applied,together with their respective use counts. As such, one embodiment of adata item status record 532 might include a list of policies that havebeen applied, together with any deduplication rules or other storagerules and/or deduplication rules or other storage parameters that wereused when applying the rule.

Strictly as one example of tracking policies together with theirrespective use, consider the situation where a first site applies apolicy that includes retention of a particular data item through1/1/2055. Further consider that a second site applies a policy thatincludes retention of the same particular data item through 1/1/2054. Itcan be understood that retention through 1/1/2055 satisfies retentionthrough 1/1/2054, and thus, rather than applying a first policy forretention through 1/1/2055 and another policy for retention through1/1/2054, systems in accordance with the disclosure herein can achievededuplication by promoting the particular data item to be retainedthrough 1/1/2054 by merely referring to the data item that is stored forretention through 1/1/2055. A policy reference count might beincremented to reflect another occurrence of that policy being in forcefor that particular data item.

The foregoing is merely one example of a count being applied to aparticular policy. Other situations call for a count being applied to aparticular rule. In some cases, and as shown, an applicable rule isstored in a master directory 112 that comprises one or more data itemstatus records together with corresponding one or more parameters. Thesingle master directory 112 as shown can hold any number of data itemstatus records. Alternatively, a master directory can be formed of datathat is distributed throughout any number of nodes of a computingsystem. In some cases, a logical master directory can be formed of datathat is physically distributed across any number of computing nodes.Further, a logical master directory can be formed of data that isphysically distributed across any number of computing nodes in a storagepool that is itself a logical construction of storage areas attached toany number of computing nodes, which storage pool can be accessed as asingle large storage area comprised of contiguous storage extents.

One result of the foregoing flow derives from processing of step 540 orof step 550. Specifically, at the conclusion of processing the flow ofdata item status determination technique 500, a data item status 233 isemitted. The data item status 233 might comprise all or part of the dataitem status record. Accordingly, such a data item status can be used tocollect rules which in turn are used in making decisions pertaining topolicy-based data deduplication. One possible rule collection techniqueis given in FIG. 6.

FIG. 6 is a data flow diagram showing a rule collection technique 600 asused in system that performs policy-based data deduplication. As anoption, one or more variations of rule collection technique 600 or anyaspect thereof may be implemented in the context of the architecture andfunctionality of the embodiments described herein. The rule collectiontechnique 600 or any aspect thereof may be implemented in anyenvironment.

The rule collection technique 600 of FIG. 6 is merely one embodiment.This particular embodiment implements step 250, which step followsperformance of the foregoing step 220 and step 230 of deduplicationoperations 204. In this embodiment, an incoming occurrence of data itemmetadata 232 and the master directory 112 are used in combination tocollect all of the policies and corresponding rules that would at leastpotentially apply to the data item corresponding to the aforementionedincoming data item metadata. Step 610 determines the source of the dataitem. In some cases, it does this by accessing the data item metadataand locating a field that describes the source (e.g., data producersite) of the data item. In other cases, it does this by accessing aseparate database that relates a particular event to a respective source(e.g., data producer site).

At step 620, relationship database 132 is accessed so as to retrieve thepolicies that correspond to the identified site. Then, for each policythat corresponds to the site, a set of applicable rules are amalgamated(step 630). The amalgamated rules can include all forms of rules thatare at least potentially applicable. For example, the amalgamated rulesmight include rules that have already been applied, rules that areincluded in a policy but have not yet been applied, and/or any rulesthat are at least potentially applicable to one of the policies for theparticular site. At step 640, such an amalgamation is codified as theshown set of potentially applicable rules 244.

In some cases, two different rules might be wholly or partially inconflict with each other. Accordingly, the full set of potentiallyapplicable rules 244 are to be analyzed to identify conflicts, thenature of such conflicts (if any), and to determine how to reconcileconflicts in the context of a data deduplication system. One possiblerule analysis technique is shown and described as pertains to FIG. 7.

FIG. 7 is a data flow diagram showing a rule analysis technique 700 asused in systems that perform policy-based data deduplication. As anoption, one or more variations of rule analysis technique 700 or anyaspect thereof may be implemented in the context of the architecture andfunctionality of the embodiments described herein. The rule analysistechnique 700 or any aspect thereof may be implemented in anyenvironment.

As earlier indicated, the full set of potentially applicable rules areanalyzed to identify any potential conflicts and to determine how toreconcile such conflicts. Conflicts can arise in several situations, forexample when two rules are logically incompatible, or when two rules arelogically compatible, but have different parameter values. Of course, itis also possible that a particular full set of potentially applicablerules might include rules that are not in conflict, and/or that havealready been applied. The rule analysis technique 700 produces a set ofapplicable actions 245, which set of applicable actions is used insubsequent processing.

There are many reasons why application of a particular rule might be inconflict with another rule and/or the are many reasons why actions orstates that derive from application of one particular rule might be inconflict with actions that ensue from a different rule. The flow of FIG.7 resolves to actions that avoid persisting conflicts. Table 2 describesseveral rule resolution scenarios.

TABLE 2 Rule resolution scenarios Data Item Status (from Subject Rule orAction previously-applied rule) Determination/Action Save in “Tier1”storage Saved in “Tier1” storage No incompatibility, already being savedin “Tier1” satisfies both rules. Parameter “cold” in a rule Alreadysaved in “hot” Observe the higher-performance emits an instruction tostorage parameter, (e.g., promote to a higher save in “cold” storageperformance tier) and indicate saved in “hot” storage in the masterdirectory. Save with “21 day” Save with “3 day” retention Observe lessrestrictive parameter. retention Indicate “21 day” retention in themaster directory. Do not comingle “Tenant1” This block has already beenSave this block in a separate storage data with other data stored for“Tenant2” area for “Tenant1”. Indicate block owners ownership in themaster directory.

Any of the determinations and/or actions to be taken in a particularscenario can be codified as rule resolution actions 761 that are storedin a rule resolution database 760 that is accessible to any of theoperations of rule analysis technique 700.

The processing of rule analysis technique 700 includes iterating througheach of the potentially applicable rules. Step 720 analyses the rule ofthe current iteration against the other potentially applicable rules. Insome cases, one potentially applicable rule can be in conflict withanother potentially applicable rule. In such cases, one or another ruleis to be accepted, or one or both rules are to be modified to remove theconflict. For example, if a rule being considered in the currentiteration carries the semantics of “Save with 21 day retention” andanother one of the potentially applicable rules carries the semantics of“Save with 3 day retention”, then the rule with the semantic meaning of“Save with 21 day retention” can be selected. The other rule with thesemantic meaning of “Save with 3 day retention” can be rejected becausethe rule with the semantic meaning of “Save with 3 day retention” issatisfied by the rule with the semantic meaning of “Save with 21 dayretention”.

In some cases, a rule conflict might not become apparent until thecorresponding data item status is analyzed with respect to the rule ofthe current iteration. Such conditions are handled by processing withinstep 725. Strictly as an example, a rule with the semantics of “Do notcomingle Tenant1 data with other data owners” might be violated if asave operation were performed on the data item such that the data itemwere to be stored in a storage area where there already existed data ofother tenants besides “Tenant1”. In such a case, even though it might bepossible to deduplicate the data item (e.g., due to a data item with thesame fingerprint or checksum already being stored), to observe the rule“Do not comingle Tenant1 data with other data owners”, the data itemwould need to be stored again in a different storage area. The foregoingis merely one example. There are other situations where conflicting orpotentially conflicting rules can be resolved through other actions. Asshown, a rule resolution database 760 includes rule resolution actions761, which rule resolution actions might include resolving conflicts byapplying a priority to each rule and choosing the rule with the higherpriority. In some cases, rule conflicts cannot be resolved withoutadministrative intervention. In such cases, an error can be emitted, andan administrator can take remedial action.

In most cases, however, conflict variations between rules and/orconflicts between a rule and pre-existing conditions can be resolved byprocessing of step 725. Three possible paths are shown in FIG. 7. Asshown, one path (same rule path 724) continues processing at decision730 to determine if that rule had already been processed. For example,if the rule is stated as “Save in tier1 storage” and a data item withthe same fingerprint or checksum had already been stored in tier1storage, then the rule for that data item had already been performed,and need not be performed again. On the other hand, if step 720 or step725 identifies a parameter variation between rules, then parametervariation path 726 is taken so as to resolve parameter values inaccordance with operations of step 740. Example cases for resolvingparameter value variations are given in the examples given in Table 2and corresponding discussion.

In yet another case, as indicated by incompatible path 728 and step 750,it is possible that two rules and/or a rule and a previous state are ofa sufficiently incompatible nature that the rule resolution database isconsulted. One or more rule resolution actions 761 can be taken so as toreconcile the incompatibility.

Any or some or all of the iterations, through decision 730 and/or step740 and/or step 750 might include a step (step 736) to add anotheraction into a set, so as to amalgamate a set of applicable actions. Whenall iterations have completed, the FOR EACH loop ends, and theamalgamated set of applicable actions 245 is provided to downstreamprocessing.

FIG. 8 is a diagram depicting a storage instruction dispatch technique800 for use in systems that implement policy-based data deduplication.As an option, one or more variations of storage instruction dispatchtechnique 800 or any aspect thereof may be implemented in the context ofthe architecture and functionality of the embodiments described herein.The storage instruction dispatch technique 800 or any aspect thereof maybe implemented in any environment.

When all of the rules and/or conditions have been deemed to becompatible, or have been reconciled, then applicable actions can betaken over the data item. In some cases, the data item is merelydiscarded since it can be deduplicated (e.g., not stored again). Inother cases, action is taken over the data item in accordance withany/all of the compatible rules or reconciled deduplication rules orreconciled deduplication parameters. At step 810, an action is convertedinto a storage command or other form of instruction to be sent to any ofthe available storage facilities. In some cases, converting an actioninto a storage command or instruction is performed by matching the THENclause of a rule into the syntax of a storage command. For example, arule that includes the clause, “store data item “X” in ‘Tier1’” might beconverted to a storage command of the form “WRITE (/dev/A, loc (X), 1,BlockStart (1025))”, where “WRITE” is a command verb, “dev/A” specifiesthe storage device or facility, “loc(X)” is the location of the dataitem “x”, and “1” is the number of blocks to write.

At step 820, an action can be transformed or converted into a storagecommand, and the storage command is sent to the intended storagefacility. In some cases, the storage facility is centralized such aslocal data storage facility 118. In other cases, the storage facilitymay be in a remote location, such as at or within or accessed throughremote data storage facility 120. The action taken and/or the effect ofthe action taken is indicated in the master directory. For example, atstep 830, if the action taken was to WRITE the data block to remotestorage, then the master directory might be updated to contain an entrythat indicates that “X” has been stored in the remote facility. Afterthe set of applicable actions have been processed, then at step 840, aresponse indicating status and completion of the relationship-baseddeduplication action is sent to the originator of the storage request.

ADDITIONAL EMBODIMENTS OF THE DISCLOSURE Additional PracticalApplication Examples

FIG. 9A depicts a system 9A00 as an arrangement of computing modulesthat are interconnected so as to operate cooperatively to implementcertain of the herein-disclosed embodiments. This and other embodimentspresent particular arrangements of elements that, individually and/or ascombined, serve to form improved technological processes that addressmaking decisions to replicate a block of data or not to replicate ablock of data to a storage repository based merely on a fingerprint orother measure of uniqueness of the data in the block are too coarse. Thepartitioning of system 9A00 is merely illustrative and other partitionsare possible. As an option, the system 9A00 may be implemented in thecontext of the architecture and functionality of the embodimentsdescribed herein. Of course, however, the system 9A00 or any operationtherein may be carried out in any desired environment.

The system 9A00 comprises at least one processor and at least onememory, the memory serving to store program instructions correspondingto the operations of the system. As shown, an operation can beimplemented in whole or in part using program instructions accessible bya module. The modules are connected to a communication path 9A05, andany operation can communicate with other operations over communicationpath 9A05. The modules of the system can, individually or incombination, perform method operations within system 9A00. Anyoperations performed within system 9A00 may be performed in any orderunless as may be specified in the claims.

The shown embodiment implements a portion of a computer system,presented as system 9A00, comprising one or more computer processors toexecute a set of program code instructions (module 9A10) and modules foraccessing memory to hold program code instructions to perform:identifying a data storage site that is interfaced to a network that isconfigured to receive data from a plurality of producer sites (module9A20); processing, by the data storage site, a first data item receivedfrom a first data producer site by determining a first relationshipbetween the first data producer site and the data storage site (module9A30); storing the first data item with a first set of storageattributes, wherein the first set of storage attributes are based atleast in part on the first relationship (module 9A40); processing, bythe data storage site, an exact copy of the first data item receivedfrom a second data producer site by determining a second relationshipbetween the second data producer site and the data storage site (module9A50); and storing the exact copy of the first data item with a secondset of storage attributes, wherein the second set of storage attributesare based at least in part on the second relationship (module 9A60).

Variations of the foregoing may include more or fewer of the shownmodules. Certain variations may perform more or fewer (or different)steps, and/or certain variations may use data elements in more, or infewer (or different) operations.

Still further, some embodiments include variations in the operationsperformed, and some embodiments include variations of aspects of thedata elements used in the operations.

FIG. 9B depicts a system 9B00 as an arrangement of computing modulesthat are interconnected so as to operate cooperatively to implementcertain of the herein-disclosed embodiments. The partitioning of system9B00 is merely illustrative and other partitions are possible. As anoption, the system 9B00 may be implemented in the context of thearchitecture and functionality of the embodiments described herein. Ofcourse, however, the system 9B00 or any operation therein may be carriedout in any desired environment.

The system 9B00 comprises at least one processor and at least onememory, the memory serving to store program instructions correspondingto the operations of the system. As shown, an operation can beimplemented in whole or in part using program instructions accessible bya module. The modules are connected to a communication path 9B05, andany operation can communicate with other operations over communicationpath 9B05. The modules of the system can, individually or incombination, perform method operations within system 9B00. Anyoperations performed within system 9B00 may be performed in any orderunless as may be specified in the claims.

The shown embodiment implements a portion of a computer system,presented as system 9B00, comprising one or more computer processors toexecute a set of program code instructions (module 9B10) and modules foraccessing memory to hold program code instructions to perform:identifying a data storage site that is interfaced to a network that isconfigured to receive data from a plurality of producer sites (module9B20); processing, by the data storage site, a first data item receivedfrom a first data producer site by determining a first relationshipbetween the first data producer site and the data storage site (module9B30); storing the first data item with a first set of storageattributes, wherein the first set of storage attributes are based atleast in part on the first relationship (module 9B40); processing, bythe data storage site, an exact copy of the first data item receivedfrom a second data producer site by determining a second relationshipbetween the second data producer site and the data storage site (module9B50); determining differences between the first relationship and thesecond relationship (module 9B60); and not storing the exact copy of thefirst data item even though there are differences between the firstrelationship and the second relationship (module 9B70).

System Architecture Overview Additional System Architecture Examples

FIG. 10A depicts a virtualized controller as implemented by the shownvirtual machine architecture 10A00. The heretofore-disclosedembodiments, including variations of any virtualized controllers, can beimplemented in distributed systems where a plurality ofnetworked-connected devices communicate and coordinate actions usinginter-component messaging. Distributed systems are systems ofinterconnected components that are designed for, or dedicated to,storage operations as well as being designed for, or dedicated to,computing and/or networking operations. Interconnected components in adistributed system can operate cooperatively to achieve a particularobjective, such as to provide high performance computing, highperformance networking capabilities, and/or high performance storageand/or high capacity storage capabilities. For example, a first set ofcomponents of a distributed computing system can coordinate toefficiently use a set of computational or compute resources, while asecond set of components of the same distributed storage system cancoordinate to efficiently use a set of data storage facilities.

A hyperconverged system coordinates the efficient use of compute andstorage resources by and between the components of the distributedsystem. Adding a hyperconverged unit to a hyperconverged system expandsthe system in multiple dimensions. As an example, adding ahyperconverged unit to a hyperconverged system can expand the system inthe dimension of storage capacity while concurrently expanding thesystem in the dimension of computing capacity and also in the dimensionof networking bandwidth. Components of any of the foregoing distributedsystems can comprise physically and/or logically distributed autonomousentities.

Physical and/or logical collections of such autonomous entities cansometimes be referred to as nodes. In some hyperconverged systems,compute and storage resources can be integrated into a unit of a node.Multiple nodes can be interrelated into an array of nodes, which nodescan be grouped into physical groupings (e.g., arrays) and/or intological groupings or topologies of nodes (e.g., spoke-and-wheeltopologies, rings, etc.). Some hyperconverged systems implement certainaspects of virtualization. For example, in a hypervisor-assistedvirtualization environment, certain of the autonomous entities of adistributed system can be implemented as virtual machines. As anotherexample, in some virtualization environments, autonomous entities of adistributed system can be implemented as executable containers. In somesystems and/or environments, hypervisor-assisted virtualizationtechniques and operating system virtualization techniques are combined.

As shown, virtual machine architecture 10A00 comprises a collection ofinterconnected components suitable for implementing embodiments of thepresent disclosure and/or for use in the herein-described environments.Moreover, virtual machine architecture 10A00 includes a virtual machineinstance in configuration 1051 that is further described as pertainingto controller virtual machine instance 1030. Configuration 1051 supportsvirtual machine instances that are deployed as user virtual machines, orcontroller virtual machines or both. Such virtual machines interfacewith a hypervisor (as shown). Some virtual machines include processingof storage 110 (input/output or 10) as received from any or every sourcewithin the computing platform. An example implementation of such avirtual machine that processes storage I/O is depicted as 1030.

In this and other configurations, a controller virtual machine instancereceives block I/O (input/output or IO) storage requests as network filesystem (NFS) requests in the form of NFS requests 1002, and/or internetsmall computer storage interface (iSCSI) block IO requests in the formof iSCSI requests 1003, and/or Samba file system (SMB) requests in theform of SMB requests 1004. The controller virtual machine (CVM) instancepublishes and responds to an internet protocol (IP) address (e.g., CVMIP address 1010). Various forms of input and output (I/O or IO) can behandled by one or more IO control handler functions (e.g., IOCTL handlerfunctions 1008) that interface to other functions such as data IOmanager functions 1014 and/or metadata manager functions 1022. As shown,the data IO manager functions can include communication with virtualdisk configuration manager 1012 and/or can include direct or indirectcommunication with any of various block IO functions (e.g., NFS TO,iSCSI TO, SMB TO, etc.).

In addition to block IO functions, configuration 1051 supports IO of anyform (e.g., block TO, streaming TO, packet-based TO, HTTP traffic, etc.)through either or both of a user interface (UI) handler such as UI IOhandler 1040 and/or through any of a range of application programminginterfaces (APIs), possibly through API IO manager 1045.

Communications link 1015 can be configured to transmit (e.g., send,receive, signal, etc.) any type of communications packets comprising anyorganization of data items. The data items can comprise a payload data,a destination address (e.g., a destination IP address) and a sourceaddress (e.g., a source IP address), and can include various packetprocessing techniques (e.g., tunneling), encodings (e.g., encryption),and/or formatting of bit fields into fixed-length blocks or intovariable length fields used to populate the payload. In some cases,packet characteristics include a version identifier, a packet or payloadlength, a traffic class, a flow label, etc. In some cases, the payloadcomprises a data structure that is encoded and/or formatted to fit intobyte or word boundaries of the packet.

In some embodiments, hard-wired circuitry may be used in place of, or incombination with, software instructions to implement aspects of thedisclosure. Thus, embodiments of the disclosure are not limited to anyspecific combination of hardware circuitry and/or software. Inembodiments, the term “logic” shall mean any combination of software orhardware that is used to implement all or part of the disclosure.

The term “computer readable medium” or “computer usable medium” as usedherein refers to any medium that participates in providing instructionsto a data processor for execution. Such a medium may take many formsincluding, but not limited to, non-volatile media and volatile media.Non-volatile media includes any non-volatile storage medium, forexample, solid state storage devices (SSDs) or optical or magnetic diskssuch as disk drives or tape drives. Volatile media includes dynamicmemory such as random access memory. As shown, controller virtualmachine instance 1030 includes content cache manager facility 1016 thataccesses storage locations, possibly including local dynamic randomaccess memory (DRAM) (e.g., through the local memory device access block1018) and/or possibly including accesses to local solid state storage(e.g., through local SSD device access block 1020).

Common forms of computer readable media include any non-transitorycomputer readable medium, for example, floppy disk, flexible disk, harddisk, magnetic tape, or any other magnetic medium; CD-ROM or any otheroptical medium; punch cards, paper tape, or any other physical mediumwith patterns of holes; or any RAM, PROM, EPROM, FLASH-EPROM, or anyother memory chip or cartridge. Any data can be stored, for example, inany form of external data repository 1031, which in turn can beformatted into any one or more storage areas, and which can compriseparameterized storage accessible by a key (e.g., a filename, a tablename, a block address, an offset address, etc.). External datarepository 1031 can store any forms of data, and may comprise a storagearea dedicated to storage of metadata pertaining to the stored forms ofdata. In some cases, metadata can be divided into portions. Suchportions and/or cache copies can be stored in the external storage datarepository and/or in a local storage area (e.g., in local DRAM areasand/or in local SSD areas). Such local storage can be accessed usingfunctions provided by local metadata storage access block 1024. Externaldata repository 1031 can be configured using CVM virtual disk controller1026, which can in turn manage any number or any configuration ofvirtual disks.

Execution of the sequences of instructions to practice certainembodiments of the disclosure are performed by one or more instances ofa software instruction processor, or a processing element such as a dataprocessor, or such as a central processing unit (e.g., CPU1, CPU2,CPUN). According to certain embodiments of the disclosure, two or moreinstances of configuration 1051 can be coupled by communications link1015 (e.g., backplane, LAN, PSTN, wired or wireless network, etc.) andeach instance may perform respective portions of sequences ofinstructions as may be required to practice embodiments of thedisclosure.

The shown computing platform 1006 is interconnected to the Internet 1048through one or more network interface ports (e.g., network interfaceport 1023 ₁ and network interface port 1023 ₂). Configuration 1051 canbe addressed through one or more network interface ports using an IPaddress. Any operational element within computing platform 1006 canperform sending and receiving operations using any of a range of networkprotocols, possibly including network protocols that send and receivepackets (e.g., network protocol packet 1021 ₁ and network protocolpacket 1021 ₂).

Computing platform 1006 may transmit and receive messages that can becomposed of configuration data and/or any other forms of data and/orinstructions organized into a data structure (e.g., communicationspackets). In some cases, the data structure includes program codeinstructions (e.g., application code) communicated through the Internet1048 and/or through any one or more instances of communications link1015. Received program code may be processed and/or executed by a CPU asit is received and/or program code may be stored in any volatile ornon-volatile storage for later execution. Program code can betransmitted via an upload (e.g., an upload from an access device overthe Internet 1048 to computing platform 1006). Further, program codeand/or the results of executing program code can be delivered to aparticular user via a download (e.g., a download from computing platform1006 over the Internet 1048 to an access device).

Configuration 1051 is merely one sample configuration. Otherconfigurations or partitions can include further data processors, and/ormultiple communications interfaces, and/or multiple storage devices,etc. within a partition. For example, a partition can bound a multi-coreprocessor (e.g., possibly including embedded or collocated memory), or apartition can bound a computing cluster having a plurality of computingelements, any of which computing elements are connected directly orindirectly to a communications link. A first partition can be configuredto communicate to a second partition. A particular first partition and aparticular second partition can be congruent (e.g., in a processingelement array) or can be different (e.g., comprising disjoint sets ofcomponents).

A cluster is often embodied as a collection of computing nodes that cancommunicate between each other through a local area network (e.g., LANor virtual LAN (VLAN)) or a backplane. Some clusters are characterizedby assignment of a particular set of the aforementioned computing nodesto access a shared storage facility that is also configured tocommunicate over the local area network or backplane. In many cases, thephysical bounds of a cluster are defined by a mechanical structure suchas a cabinet or such as a chassis or rack that hosts a finite number ofmounted-in computing units. A computing unit in a rack can take on arole as a server, or as a storage unit, or as a networking unit, or anycombination therefrom. In some cases, a unit in a rack is dedicated toprovisioning of power to other units. In some cases, a unit in a rack isdedicated to environmental conditioning functions such as filtering andmovement of air through the rack and/or temperature control for therack. Racks can be combined to form larger clusters. For example, theLAN of a first rack having 32 computing nodes can be interfaced with theLAN of a second rack having 16 nodes to form a two-rack cluster of 48nodes. The former two LANs can be configured as subnets, or can beconfigured as one VLAN. Multiple clusters can communicate between onemodule to another over a WAN (e.g., when geographically distal) or a LAN(e.g., when geographically proximal).

A module as used herein can be implemented using any mix of any portionsof memory and any extent of hard-wired circuitry including hard-wiredcircuitry embodied as a data processor. Some embodiments of a moduleinclude one or more special-purpose hardware components (e.g., powercontrol, logic, sensors, transducers, etc.). A data processor can beorganized to execute a processing entity that is configured to executeas a single process or configured to execute using multiple concurrentprocesses to perform work. A processing entity can be hardware-based(e.g., involving one or more cores) or software-based, and/or can beformed using a combination of hardware and software that implementslogic, and/or can carry out computations and/or processing steps usingone or more processes and/or one or more tasks and/or one or morethreads or any combination thereof.

Some embodiments of a module include instructions that are stored in amemory for execution so as to facilitate operational and/or performancecharacteristics pertaining to policy-based data deduplication. In someembodiments, a module may include one or more state machines and/orcombinational logic used to implement or facilitate the operationaland/or performance characteristics pertaining to policy-based datadeduplication.

Various implementations of the data repository comprise storage mediaorganized to hold a series of records or files such that individualrecords or files are accessed using a name or key (e.g., a primary keyor a combination of keys and/or query clauses). Such files or recordscan be organized into one or more data structures (e.g., data structuresused to implement or facilitate aspects of policy-based datadeduplication). Such files or records can be brought into and/or storedin volatile or non-volatile memory. More specifically, the occurrenceand organization of the foregoing files, records, and data structuresimprove the way that the computer stores and retrieves data in memory,for example, to improve the way data is accessed when the computer isperforming operations pertaining to policy-based data deduplication,and/or for improving the way data is manipulated when performingcomputerized operations pertaining to techniques for deduplication thatconsider particular storage capabilities that might be desired orrequired by a particular data owner.

Further details regarding general approaches to managing deduplicateddata are described in U.S. patent application Ser. No. 15/459,706 titled“MANAGING DEDUPLICATED DATA”, filed on Mar. 15, 2017, which is herebyincorporated by reference in its entirety.

Further details regarding general approaches to managing datarepositories are described in U.S. Pat. No. 8,601,473 titled“ARCHITECTURE FOR MANAGING I/O AND STORAGE FOR A VIRTUALIZATIONENVIRONMENT”, issued on Dec. 3, 2013, which is hereby incorporated byreference in its entirety.

Further details regarding general approaches to managing and maintainingdata in data repositories are described in U.S. Pat. No. 8,549,518titled “METHOD AND SYSTEM FOR IMPLEMENTING MAINTENANCE SERVICE FORMANAGING I/O AND STORAGE FOR A VIRTUALIZATION ENVIRONMENT”, issued onOct. 1, 2013, which is hereby incorporated by reference in its entirety.

FIG. 10B depicts a virtualized controller implemented by containerizedarchitecture 10B00. The containerized architecture comprises acollection of interconnected components suitable for implementingembodiments of the present disclosure and/or for use in theherein-described environments. Moreover, the shown containerizedarchitecture 10B00 includes an executable container instance inconfiguration 1052 that is further described as pertaining to theexecutable container instance 1050. Configuration 1052 includes anoperating system layer (as shown) that performs addressing functionssuch as providing access to external requestors via an IP address (e.g.,“P.Q.R.S”, as shown). Providing access to external requestors caninclude implementing all or portions of a protocol specification (e.g.,“http:”) and possibly handling port-specific functions.

The operating system layer can perform port forwarding to any executablecontainer (e.g., executable container instance 1050). An executablecontainer instance can be executed by a processor. Runnable portions ofan executable container instance sometimes derive from an executablecontainer image, which in turn might include all, or portions of any of,a Java archive repository (JAR) and/or its contents, and/or a script orscripts and/or a directory of scripts, and/or a virtual machineconfiguration, and may include any dependencies therefrom. In some casesa configuration within an executable container might include an imagecomprising a minimum set of runnable code. Contents of larger librariesand/or code or data that would not be accessed during runtime of theexecutable container instance can be omitted from the larger library toform a smaller library composed of only the code or data that would beaccessed during runtime of the executable container instance. In somecases, start-up time for an executable container instance can be muchfaster than start-up time for a virtual machine instance, at leastinasmuch as the executable container image might be much smaller than arespective virtual machine instance. Furthermore, start-up time for anexecutable container instance can be much faster than start-up time fora virtual machine instance, at least inasmuch as the executablecontainer image might have many fewer code and/or data initializationsteps to perform than a respective virtual machine instance.

An executable container instance (e.g., a Docker container instance) canserve as an instance of an application container. Any executablecontainer of any sort can be rooted in a directory system, and can beconfigured to be accessed by file system commands (e.g., “ls” or “ls-a”,etc.). The executable container might optionally include operatingsystem components 1078, however such a separate set of operating systemcomponents need not be provided. As an alternative, an executablecontainer can include runnable instance 1058, which is built (e.g.,through compilation and linking, or just-in-time compilation, etc.) toinclude all of the library and OS-like functions needed for execution ofthe runnable instance. In some cases, a runnable instance can be builtwith a virtual disk configuration manager, any of a variety of data IOmanagement functions, etc. In some cases, a runnable instance includescode for, and access to, container virtual disk controller 1076. Such acontainer virtual disk controller can perform any of the functions thatthe aforementioned CVM virtual disk controller 1026 can perform, yetsuch a container virtual disk controller does not rely on a hypervisoror any particular operating system so as to perform its range offunctions.

In some environments multiple executable containers can be collocatedand/or can share one or more contexts. For example, multiple executablecontainers that share access to a virtual disk can be assembled into apod (e.g., a Kubernetes pod). Pods provide sharing mechanisms (e.g.,when multiple executable containers are amalgamated into the scope of apod) as well as isolation mechanisms (e.g., such that the namespacescope of one pod does not share the namespace scope of another pod).

FIG. 10C depicts a virtualized controller implemented by adaemon-assisted containerized architecture 10000. The containerizedarchitecture comprises a collection of interconnected componentssuitable for implementing embodiments of the present disclosure and/orfor use in the herein-described environments. Moreover, the showninstance of daemon-assisted containerized architecture 10000 includes auser executable container instance in configuration 1053 that is furtherdescribed as pertaining to user executable container instance 1080.Configuration 1053 includes a daemon layer (as shown) that performscertain functions of an operating system.

User executable container instance 1080 comprises any number of usercontainerized functions (e.g., user containerized function1, usercontainerized function2, . . . , user containerized functionN). Suchuser containerized functions can execute autonomously, or can beinterfaced with or wrapped in a runnable object to create a runnableinstance (e.g., runnable instance 1058). In some cases, the shownoperating system components 1078 comprise portions of an operatingsystem, which portions are interfaced with or included in the runnableinstance and/or any user containerized functions. In this embodiment ofa daemon-assisted containerized architecture, the computing platform1006 might or might not host operating system components other thanoperating system components 1078. More specifically, the shown daemonmight or might not host operating system components other than operatingsystem components 1078 of user executable container instance 1080.

In the foregoing specification, the disclosure has been described withreference to specific embodiments thereof. It will however be evidentthat various modifications and changes may be made thereto withoutdeparting from the broader spirit and scope of the disclosure. Forexample, the above-described process flows are described with referenceto a particular ordering of process actions. However, the ordering ofmany of the described process actions may be changed without affectingthe scope or operation of the disclosure. The specification and drawingsare to be regarded in an illustrative sense rather than in a restrictivesense.

1. A method comprising: identifying a first data producer site thatcommunicates with a data storage site, the first data producer siteassociated with a first deduplication parameter; identifying a seconddata producer site that communicates with the data storage site, thesecond data producer site associated with a second deduplicationparameter, wherein the first deduplication parameter is different fromthe second deduplication parameter; receiving, from the first dataproducer site, a subject data item to store at the data storage sitebased at least in part on the first deduplication parameter; receiving,from the second data producer site, a second occurrence of the subjectdata item to store at the data storage site based at least in part onthe second deduplication parameter; and reconciling the firstdeduplication parameter against the second deduplication parameter todetermine a third deduplication parameter to apply to storage of thesecond occurrence of the subject data item at the data storage site. 2.The method of claim 1, further comprising storing the second occurrenceof the subject data item at the data storage site according to the thirddeduplication parameter.
 3. The method of claim 1, further comprisingstoring metadata pertaining to the second occurrence of the subject dataitem without storing the second occurrence of the subject data item atthe data storage site.
 4. The method of claim 3, wherein the metadatacomprises at least one aspect of the third deduplication parameter. 5.The method of claim 1, wherein the data storage site comprises a datastorage facility having two or more tiers of storage, at least a firsttier of the two or more tiers of storage having different storageattributes than at least a second tier of the two or more tiers ofstorage.
 6. The method of claim 5, wherein the data storage facility isconfigured to store the subject data item in the first tier and whereinthe data storage facility is configured to store the second occurrenceof the subject data item in the second tier.
 7. The method of claim 1,wherein the data storage site is further interfaced to remote datastorage.
 8. The method of claim 1, wherein at least one of, a first datastorage relationship with the data storage site or a second data storagerelationship with the data storage site is associated with a storagepolicy.
 9. The method of claim 8, wherein a first storage policy isassociated with a first storage rule, and wherein a second storagepolicy is associated with a second storage rule that is different fromthe first storage rule.
 10. The method of claim 9, wherein the firststorage rule comprises a first storage parameter value, and wherein thesecond storage rule comprises a second storage parameter value.
 11. Themethod of claim 10, wherein the first storage parameter value isdifferent from the second storage parameter value.
 12. A systemcomprising: one or more processors; and a memory storing instructionsthat, when executed by the one or more processors, cause the system toperform acts of: identifying a data storage site; identifying a firstdata producer site that communicates with a data storage site, the firstdata producer site associated with a first deduplication parameter;identifying a second data producer site that communicates with the datastorage site, the second data producer site associated with a seconddeduplication parameter, wherein the first deduplication parameter isdifferent from the second deduplication parameter; receiving, from thefirst data producer site, a subject data item to store at the datastorage site based at least in part on the first deduplicationparameter; receiving, from the second data producer site, a secondoccurrence of the subject data item to store at the data storage sitebased at least in part on the second deduplication parameter; andreconciling the first deduplication parameter against the seconddeduplication parameter to determine a third deduplication parameter toapply to storage of the second occurrence of the subject data item atthe data storage site.
 13. The system of claim 12, wherein execution ofthe instructions causes the system to perform storing the secondoccurrence of the subject data item at the data storage site accordingto the third deduplication parameter.
 14. The system of claim 12,wherein execution of the instructions causes the system to performstoring metadata pertaining to the second occurrence of the subject dataitem without storing the second occurrence of the subject data item atthe data storage site.
 15. The system of claim 14, wherein the metadatacomprises at least one aspect of the third deduplication parameter. 16.The system of claim 12, wherein execution of the instructions causes thesystem to access two or more tiers of storage.
 17. A non-transitorycomputer readable medium comprising instructions that, when executed byone or more processors of a computer, causes the computer to carry out amethod comprising: identifying a first data producer site thatcommunicates with a data storage site, the first data producer siteassociated with a first deduplication parameter; identifying a seconddata producer site that communicates with the data storage site, thesecond data producer site associated with a second deduplicationparameter, wherein the first deduplication parameter is different fromthe second deduplication parameter; receiving, from the first dataproducer site, a subject data item to store at the data storage sitebased at least in part on the first deduplication parameter; receiving,from the second data producer site, a second occurrence of the subjectdata item to store at the data storage site based at least in part onthe second deduplication parameter; and reconciling the firstdeduplication parameter against the second deduplication parameter todetermine a third deduplication parameter to apply to storage of thesecond occurrence of the subject data item at the data storage site. 18.The non-transitory computer readable medium of claim 17, whereinexecution of the instructions causes the computer to perform storing thesecond occurrence of the subject data item at the data storage siteaccording to the third deduplication parameter.
 19. The non-transitorycomputer readable medium of claim 17, wherein execution of theinstructions causes the computer to perform storing metadata pertainingto the second occurrence of the subject data item without storing thesecond occurrence of the subject data item at the data storage site. 20.The non-transitory computer readable medium of claim 19, wherein themetadata comprises at least one aspect of the third deduplicationparameter.